Cache RoPE freqs on device to avoid repeated CPU-GPU copy in QwenImage by akshan-main · Pull Request #13406 · huggingface/diffusers

akshan-main · 2026-04-03T21:17:16Z

What does this PR do?

QwenEmbedRope.forward() copies pos_freqs and neg_freqs from CPU to GPU via .to(device) on every transformer forward call. These tensors are fixed at init and never change, so the repeated transfer triggers an unnecessary cudaStreamSynchronize (~76ms each).

Added _get_device_freqs() that caches the GPU copy on first call. Applied to both QwenEmbedRope and QwenEmbedLayer3DRope.

(register_buffer can't be used here because it drops the imaginary part of complex tensors)

Profiling (A100 80GB, eager, 2 steps, 1024x1024)

                                     BEFORE        AFTER
------------------------------ ------------ ------------
Big syncs (>50ms)                         3            0
Big sync total (ms)                   228.7          0.0
Big syncs before: [76.6, 76.4, 75.7]
Big syncs after:  []

Before (76ms cudaStreamSynchronize inside transformer_forward):

After (no sync gap):

Profiled with the tooling from #13356. Reproduction notebook.

Part of #13401

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. Help us profile important pipelines and improve if needed #13401
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul @dg845

akshan-main · 2026-04-03T22:54:46Z

The profiling was done with 2 steps, but this sync happens every transformer forward call, so at 20 inference steps, this eliminates ~1.5s of CPU-GPU sync overhead per run. Under torch.compile the impact is larger since GPU queues are deeper(each sync stalls longer) (80ms vs 76ms in eager).

akshan-main · 2026-04-04T08:09:18Z

oh and this fix applies to all QwenImage variants (Edit, EditPlus, Layered) since they share the same transformer

sayakpaul · 2026-04-08T11:13:49Z

@akshan-main thanks for this! In the second plot, could you tell which one of the blocks the reported duration belongs to?

sayakpaul

Seems like a clean fix to me. But I will let @dg845 make the final merge.

akshan-main · 2026-04-08T11:33:14Z

the selected slice in after image is the transformer_forward user_annotation itself (~439ms), wrapping the full QwenImageTransformer2DModel.forward.

I am highlighting a specific sub-block showing where the 76ms cudaStreamSynchronize used to sit (in the before screenshot) is gone.

akshan-main · 2026-04-08T11:35:50Z

~439ms is for entire transformer_forward block

akshan-main · 2026-04-09T05:27:12Z

Friendly ping @dg845, hey! seeking your review/ interpretation

dg845 · 2026-04-09T06:23:54Z

+    def _get_device_freqs(self, device: torch.device) -> tuple[torch.Tensor, torch.Tensor]:
+        """Return pos_freqs and neg_freqs on the given device, caching the transfer."""
+        if device not in self._device_freq_cache:
+            self._device_freq_cache[device] = (self.pos_freqs.to(device), self.neg_freqs.to(device))
+        return self._device_freq_cache[device]


Suggested change

def _get_device_freqs(self, device: torch.device) -> tuple[torch.Tensor, torch.Tensor]:

"""Return pos_freqs and neg_freqs on the given device, caching the transfer."""

if device not in self._device_freq_cache:

self._device_freq_cache[device] = (self.pos_freqs.to(device), self.neg_freqs.to(device))

return self._device_freq_cache[device]

@lru_cache_unless_export(maxsize=None)

def _get_device_freqs(self, device: torch.device) -> tuple[torch.Tensor, torch.Tensor]:

"""Return pos_freqs and neg_freqs on the given device."""

return self.pos_freqs.to(device), self.neg_freqs.to(device)

I think this might be slightly cleaner since lru_cache_unless_export should handle the cases where we're compiling or exporting the model correctly.

dg845 · 2026-04-09T06:24:39Z


        # DO NOT USING REGISTER BUFFER HERE, IT WILL CAUSE COMPLEX NUMBERS LOSE ITS IMAGINARY PART
        self.scale_rope = scale_rope
+        self._device_freq_cache: dict[torch.device, tuple[torch.Tensor, torch.Tensor]] = {}


Suggested change

self._device_freq_cache: dict[torch.device, tuple[torch.Tensor, torch.Tensor]] = {}

Follow-up change to #13406 (comment).

HuggingFaceDocBuilderDev · 2026-04-09T06:35:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dg845 · 2026-04-09T06:36:48Z

@bot /style

github-actions · 2026-04-09T06:37:14Z

Style bot fixed some files and pushed the changes.

dg845

Thanks for the PR! Left one suggestion about using lru_cache_unless_export instead of caching manually.

akshan-main · 2026-04-09T06:46:21Z

@dg845 done! switched both QwenEmbedRope and QwenEmbedLayer3DRope to lru_cache_unless_export
ci testing shouldn't be a issue too now

dg845

Thanks!

dg845 · 2026-04-09T07:15:10Z

Hi @akshan-main, have you also profiled the QwenImage pipeline when using torch.compile? Are the CPU to GPU syncs also eliminated by this fix in that case?

akshan-main · 2026-04-09T07:39:52Z

profiling compile before/after now, will update with numbers

akshan-main · 2026-04-09T08:07:00Z

@dg845 profiled compile before/after. torch.compile() already eliminates the big syncs on its own (0 big syncs in both before and after). The fix specifically targets eager mode

akshan-main · 2026-04-09T08:08:52Z

Reproduction notebook for compile

akshan-main · 2026-04-10T05:35:28Z

@sayakpaul @dg845 don't think the failures are related to my PR

sayakpaul · 2026-04-10T05:41:34Z

I will let @dg845 take care of the final merging. I am looking into the failing tests (unrelated to your PR).

I also got this script to compare QwenImage outputs on this branch and main branch and they pass:

"""Compare QwenImagePipeline outputs between current branch and main."""

import subprocess
import sys

import torch
from diffusers import DiffusionPipeline


REPO_ID = "Qwen/Qwen-Image"
PROMPT = "A cat holding a sign that says hello world"
NUM_INFERENCE_STEPS = 2
HEIGHT = 256
WIDTH = 256


def get_output(pipe):
    output = pipe(
        PROMPT,
        num_inference_steps=NUM_INFERENCE_STEPS,
        height=HEIGHT,
        width=WIDTH,
        generator=torch.manual_seed(0),
        output_type="np",
    ).images[0]
    return output


def main():
    current_branch = (
        subprocess.check_output(["git", "rev-parse", "--abbrev-ref", "HEAD"]).decode().strip()
    )
    print(f"Current branch: {current_branch}")

    # --- Current branch ---
    print("Loading pipeline on current branch...")
    pipe = DiffusionPipeline.from_pretrained(REPO_ID, torch_dtype=torch.bfloat16).to("cuda")
    print("Computing output on current branch...")
    output_current = get_output(pipe)

    del pipe
    torch.cuda.empty_cache()

    # --- main branch ---
    print("Checking out main branch...")
    subprocess.check_call(["git", "checkout", "main"])

    # Reload diffusers from main
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-e", ".", "--quiet"])

    print("Loading pipeline on main branch...")
    pipe = DiffusionPipeline.from_pretrained(REPO_ID, torch_dtype=torch.bfloat16).to("cuda")
    print("Computing output on main branch...")
    output_main = get_output(pipe)

    del pipe
    torch.cuda.empty_cache()

    # --- Restore original branch ---
    print(f"Restoring branch: {current_branch}")
    subprocess.check_call(["git", "checkout", current_branch])

    # --- Compare ---
    max_diff = abs(output_current - output_main).max()
    mean_diff = abs(output_current - output_main).mean()

    print(f"\nMax absolute difference:  {max_diff}")
    print(f"Mean absolute difference: {mean_diff}")

    if max_diff < 1e-3:
        print("PASSED: Outputs match.")
    else:
        print("FAILED: Outputs differ significantly.")


if __name__ == "__main__":
    main()

akshan-main · 2026-04-10T06:30:11Z

@sayakpaul outputs should match since the fix only changes how the freqs are cached, not the computation itself

akshan-main · 2026-04-10T06:38:43Z

friendly ping @dg845 same tests are failing

dg845

Thanks!

dg845 · 2026-04-10T06:41:40Z

Merging as the CI failures should be unrelated.

akshan-main · 2026-04-10T13:18:03Z

Hey @sayakpaul @dg845, I am very much interested in this theme of performance engineering fixes. I was wondering what pipeline you would want me to look into next so that I can contribute promptly.

sayakpaul · 2026-04-10T13:45:09Z

It could be nice to continue on the Qwen series? WDYT?

akshan-main · 2026-04-10T14:02:23Z

@sayakpaul I can do that. I will work on QwenImageEdit next

akshan-main · 2026-04-10T15:48:38Z

@sayakpaul @dg845

just done profiling QwenImageEdit (eager, 2 steps). after the RoPE fix, remaining syncs are ~150ms from masked boolean indexing in _extract_masked_hidden (~80ms) and torch.tensor(config.latents_mean).to(device) in _encode_vae_image (~66ms). These are not fixable without restructuring the masking approach. QwenImageEdit looks clean otherwise.

Notebook link where I verified:
https://colab.research.google.com/gist/akshan-main/e7f85bb6996908e9c62ee1797f287f26/qwenimageedit_profile.ipynb

huggingface#13406) * Cache RoPE freqs on device to avoid repeated CPU-GPU copy in QwenImage * Apply style fixes * use lru_cache_unless_export --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Cache RoPE freqs on device to avoid repeated CPU-GPU copy in QwenImage

1994ea3

akshan-main mentioned this pull request Apr 3, 2026

Help us profile important pipelines and improve if needed #13401

Open

dg845 requested review from dg845 and sayakpaul April 8, 2026 05:39

sayakpaul approved these changes Apr 8, 2026

View reviewed changes

dg845 reviewed Apr 9, 2026

View reviewed changes

Apply style fixes

073b5b1

github-actions Bot added models size/S PR with diff < 50 LOC labels Apr 9, 2026

dg845 reviewed Apr 9, 2026

View reviewed changes

use lru_cache_unless_export

6875e95

github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 9, 2026

akshan-main requested a review from dg845 April 9, 2026 06:40

dg845 approved these changes Apr 9, 2026

View reviewed changes

Merge branch 'main' into fix-qwenimage-rope-sync

4e8de47

github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 9, 2026

akshan-main requested a review from dg845 April 9, 2026 08:09

Merge branch 'main' into fix-qwenimage-rope-sync

3ff78fd

github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 10, 2026

akshan-main requested a review from sayakpaul April 10, 2026 05:34

Merge branch 'main' into fix-qwenimage-rope-sync

0fcfb2c

github-actions Bot added size/S PR with diff < 50 LOC and removed size/S PR with diff < 50 LOC labels Apr 10, 2026

dg845 approved these changes Apr 10, 2026

View reviewed changes

dg845 merged commit 4548e68 into huggingface:main Apr 10, 2026
10 of 14 checks passed

sayakpaul added the performance Anything related to performance improvements, profiling and benchmarking label Apr 10, 2026

sayakpaul added this to Diffusers Roadmap 0.39 Apr 10, 2026

github-project-automation Bot moved this from In Progress to Done in Diffusers Roadmap 0.39 Apr 10, 2026

github-project-automation Bot moved this to In Progress in Diffusers Roadmap 0.39 Apr 10, 2026

This was referenced Apr 10, 2026

Fix HunyuanVideo 1.5 I2V by preprocessing image at pixel resolution i… #13440

Merged

Add Ernie-Image modular pipeline #13498

Merged

Uh oh!

Conversation

akshan-main commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Profiling (A100 80GB, eager, 2 steps, 1024x1024)

Before submitting

Who can review?

Uh oh!

akshan-main commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshan-main commented Apr 4, 2026

Uh oh!

sayakpaul commented Apr 8, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

akshan-main commented Apr 8, 2026

Uh oh!

akshan-main commented Apr 8, 2026

Uh oh!

akshan-main commented Apr 9, 2026

Uh oh!

dg845 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 9, 2026

Uh oh!

dg845 commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

akshan-main commented Apr 9, 2026

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

dg845 commented Apr 9, 2026

Uh oh!

akshan-main commented Apr 9, 2026

Uh oh!

akshan-main commented Apr 9, 2026

Uh oh!

akshan-main commented Apr 9, 2026

Uh oh!

akshan-main commented Apr 10, 2026

Uh oh!

sayakpaul commented Apr 10, 2026

Uh oh!

akshan-main commented Apr 10, 2026

Uh oh!

akshan-main commented Apr 10, 2026

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

dg845 commented Apr 10, 2026

Uh oh!

Uh oh!

akshan-main commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Apr 10, 2026

Uh oh!

akshan-main commented Apr 10, 2026

Uh oh!

akshan-main commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

akshan-main commented Apr 3, 2026 •

edited

Loading

akshan-main commented Apr 3, 2026 •

edited

Loading

github-actions Bot commented Apr 9, 2026 •

edited

Loading

akshan-main commented Apr 10, 2026 •

edited

Loading

akshan-main commented Apr 10, 2026 •

edited

Loading